Account for inaccurate offsets in getXrefData() #692

GreyWyvern · 2024-03-14T16:49:34Z

Type of pull request

Bug fix (involves code and configuration changes)

About

Normally offset pointers to xref keywords in a PDF document are exact to the byte. However, in some cases the pointer may point to some whitespace before the xref keyword. Adobe Acrobat takes these 'errors' in stride, displaying the document anyway, and so should PdfParser.

Clean up the getXrefData() function in RawDataParser.php. It now only needs to do one preg_match_all() and pushes the caret past whitespace when looking for xref keywords.

Use existing Issue557.pdf to create a new file: Issue673.pdf where the last /Prev 13486 command has been decremented to /Prev 13485. Trying to parse this file would cause an Exception without this fix. Resolves #673.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.
In case you fix an existing issue, please do one of the following:
- Write in this text something like fixes #1234 to outline that you are providing a fix for the issue #1234.

Normally offset pointers to `xref` keywords in a PDF document are exact to the byte. However, in some cases the pointer may point to some whitespace *before* the `xref` keyword. Adobe Acrobat takes these 'errors' in stride, displaying the document anyway, and so should PdfParser. Clean up the getXrefData() function in **RawDataParser.php**. It now only needs to do one `preg_match_all()` and pushes the caret past whitespace when looking for `xref` keywords. Use existing **Issue557.pdf** to create a new file: **Issue673.pdf** where the last `/Prev 13486` command has been decremented to `/Prev 13485`. Trying to parse this file would cause an Exception without this fix.

k00ni

Thank you @GreyWyvern, good work! I have just a few remarks/questions.

src/Smalot/PdfParser/RawData/RawDataParser.php

No need to use `PREG_OFFSET_CAPTURE` here.

k00ni

Sorry for the delay. I merged in a recent change to get rid of the coding style issue.

k00ni added the fix label Mar 25, 2024

k00ni requested changes Mar 25, 2024

View reviewed changes

src/Smalot/PdfParser/RawData/RawDataParser.php Show resolved Hide resolved

src/Smalot/PdfParser/RawData/RawDataParser.php Show resolved Hide resolved

GreyWyvern and others added 2 commits March 25, 2024 16:22

Drop unnecessary PREG_OFFSET_CAPTURE

45f7e53

No need to use `PREG_OFFSET_CAPTURE` here.

Merge branch 'master' into xref-offset

f880ced

k00ni approved these changes Apr 2, 2024

View reviewed changes

k00ni merged commit fb77eab into smalot:master Apr 2, 2024
29 checks passed

GreyWyvern deleted the xref-offset branch May 10, 2024 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Account for inaccurate offsets in getXrefData() #692

Account for inaccurate offsets in getXrefData() #692

GreyWyvern commented Mar 14, 2024 •

edited

Loading

k00ni left a comment

k00ni left a comment

Account for inaccurate offsets in getXrefData() #692

Account for inaccurate offsets in getXrefData() #692

Conversation

GreyWyvern commented Mar 14, 2024 • edited Loading

Type of pull request

About

Checklist for code / configuration changes

k00ni left a comment

Choose a reason for hiding this comment

k00ni left a comment

Choose a reason for hiding this comment

GreyWyvern commented Mar 14, 2024 •

edited

Loading